TD - Chapter details v2

Chapter details

Overview

ATLAS has a module for automatic categorization of documents, which provides access to several statistical and distance based algorithms. The categorization module instantiates the configured algorithms with different feature types; furthermore, the module is able to start several algorithms simultaneously and combine the results of each classifier.

Implementation

The module registers one or more algorithms as OSGI services, according to the configuration settings. ATLAS uses these services for categorization tasks which users initiate.

The plugin com.tetracom.atlas.textmining.categorization.algorithms contains implementations of the different categorization algorithms available in ATLAS.

The file com.tetracom.atlas.textmining.categorization.algorithms.properties contains the configuration settings for the automatic categorization module. The file has the following format:

The class CategorizationAlgorithmsProviderService reads the configuration settings, creates instances of the categorization algorithms, and registers them as OSGI services.

Each categorization algorithm is an instance of the ISpecificAutomaticCategorizationService interface. CategorizationAlgorithmFactory uses three parameters to create ISpecificAutomaticCategorizationService instances - name, feature.type and feature.reduction.

Possible options for the name parameter are:

naive_bayesian (Naïve Bayesian Algorithm).
relative_entropy (Relative Entropy Algorithm)
cfc (Class-Featured Centroid Algorithm)
cfcmodif (Modifed Class-Featured Centroid Algorithm)

Possible options for the feature.type parameter are:

token (Features based on tokens)
lemma (Features based on lemmas)
np (Features based on noun phrases)
headtokens (Features based on head tokens)

Possible options for the feature.reduction parameter are:

none (Feature reductions is not applied)
top_x (where x is a number between 1 and 100 – the percentage of features to keep)

Bases on these parameters, CategorizationAlgorithmFactory returns a new CategorizationAlgorithm object, which is constructed with corresponding IFeatureSpaceReducer, ICategoryVectorCreator and IDocsClassifier instances.

Each ISpecificAutomaticCategorizationService generates its own algorithm identifiers of the form algorithmIdentifier = name _ feature.type _ feature.reduction.

This identifier is used to distinguish between different instances of the algorithms in the application.

The following sequence of actions is executed in the CategorizationAlgorithm.createModel method:

Document features (according to the feature.type parameter) for all of the training documents are fetched from the data store.
The feature space for the training set is built.
The feature space is reduced by the IFeatureSpaceReducer.
The training document features are normalized.
ICategoryVectorCreator creates the categories vectors.
The Category vectors are normalized.
A new instance of BinaryModel is returned as a result.

The following actions are executed in the CategorizationAlgorithm.useModel method:

Document features (according to the feature.type parameter) for all of the unlabeled documents (test documents) are fetched from the data store.
All test documents that do not have features from the model space are ignored.
The test document features are normalized.
IDocsClassifier classifies the documents.
The classification results are returned as Map< UUID/*doc uid*/, Map< UUID/*cat uid*/, Double/*relevance*/ >>.

Technical

Categorization module

Overview

Implementation